Table driven fault recovery system with redundancy and priority handling

ABSTRACT

A hierarchal organization of registers containing error indicators is utilized to determine which of a plurality of error recovery modules is to be executed. A redundancy mask and a priority mask stored in a data table associated with each register are sequentially applied to each register in order to eliminate redundant error indicators resulting from a common fault and to set the order of execution of error recovery modules. Thus, the redundancy and priorities associated with detected errors can be controlled substantially independent of the actual error recovery actions to be taken.

BACKGROUND OF THE INVENTION

This invention is directed generally to the handling of errors or faultsin a computer system operating under the control of a software program.More specifically, this invention addresses how different error recoveryactions are selected especially when multiple concurrent faults aredetected.

Error handling methods in a microprocessor environment normally utilizethe interrupt vectors supported by the microprocessor. Upon receipt ofan interrupt vector associated with detection of a fault, themicroprocessor interrupts the currently executing program and executesan alternative program corresponding to the interrupt vector. Faults canbe detected by known software and hardware detection techniques. In atypical error handling method, the type and source of the faultdetermine the error handling process to be executed by selecting theaddress associated with the desired error handling routine.

Error handling techniques are important in large complex systems whichoperate under software control since hardware devices and differentprograms are being concurrently utilized. Error handling and recoverytechniques become critical in systems where uninterruptible service mustbe provided such as in a telecommunications switch environment. In aknown recovery technique used in complex systems, error recoveryroutines and a selection routine that controls which recovery routine toexecute have been combined into a sequential coded, integral, errorhandling system. Such a technique performs best when faults occursequentially so that only a single fault has to be dealt with at a time.However, in complex systems concurrent faults occur and error handlingpriorities must be assigned to resolve the order in which the errors areaddressed. When a change is needed in the order in which concurrenterrors are to be handled, the priorities must be amended. The sequentialcoded, integral, error handling systems must be carefully reviewed andtested to insure that changes to the order of handling errors have notintroduced an error condition in the error handling system itself.Unneeded error handling routines are often executed where the executionof one concurrent error is sufficient to eliminate a fault which gaverise to other concurrent errors.

There exists a need for an improved error handling system that allowschanges to be made in the order of handling concurrent errors with aminimum of testing. A need also exists for an error handling systemwhich allows an error to be ignored where the handling of anotherconcurrent error is sufficient to address the fault associated with theignored error.

Summary of the Invention

It is an object of the present invention to provide an error recoverymethod and apparatus which address the above needs.

In accordance with an embodiment of the present invention, a pluralityof registers are arranged in a hierarchal relationship and contain errorindicators generated by a plurality of devices and processes. A datatable associated with each register contains supervisory data for eachposition in the register. Preferably, each supervisory data entryincludes a redundancy mask, a priority mask, and an address vector thataddresses another register or error recovery module. The redundancy maskacts as a filter that cancels a concurrent error indicator(s) in view ofother recovery action to be taken by another concurrently set errorindicator. The redundancy mask eliminates the execution of unnecessaryrecovery actions where a single fault leads to the generation of aplurality of concurrent error indicators. The priority mask determinesthe order in which nonredundant concurrent error indicators are to beprocessed.

Error redundancies and priorities can be established and changedsubstantially independent of the actual error recovery actions utilizedto overcome the detected faults. This technique leads to improved errorhandling efficiencies especially where changes in redundancy orpriorities must be made.

Brief Description of the Drawings

FIG. 1 is a block diagram of a computer controlled system whichincorporates an embodiment of the present invention.

FIG. 2 illustrates the hierarchal relationship of error handlingregisters and error recovery modules in accordance with an embodiment ofthe present invention.

FIG. 3 illustrates a data table associated with register SB shown inFIG. 2.

FIG. 4 is a flow diagram of steps in accordance with an embodiment ofthe present invention illustrating use of the redundancy and prioritymasks shown in FIG. 3.

Detailed Description

FIG. 1 illustrates a software controlled system 10 which incorporates anembodiment of the present invention. System 10 includes a centralcontrol device 12 and a plurality of peripheral devices 14 whichcommunicate with device 12 by communication channels 16 which maycomprise address and data buses or other known communication methods.Illustrative device 12 includes a microprocessor (MPU) 18, read-onlymemory (ROM) 20, random access memory (RAM) 22, floppy drive 24, harddrive 26, and interrupt registers 28. Each of elements 20-28 communicatewith MPU 18 as will be understood by those skilled in the art. Devices14 communicate detected fault conditions by means of an error signaltransmitted to interrupt registers 28 by paths 30. An illustrative path32 couples error signals generated by internal devices and processes ofdevice 12 to interrupt registers 28. In a typical configuration eachsignificant hardware device and software program will include faultdetection utilized to generate an error signal representative of thedetected fault.

FIG. 2 illustrates an exemplary embodiment of interrupt registers 28arranged in a hierarchal relationship. Register R which is capable ofstoring bits a-n represents the top register in the illustratedarrangement. Each bit in register R is coupled to either a lower levelregister or an error recovery module which represents a series of steps(actions) to be implemented by MPU 18 in an attempt to clear or overcomea detected fault. In the illustrative embodiment, register SB isassociated with bit b of register R; other registers (not shown) aresimilarly associated with the other bits of register R and together withregister SB comprise the S register set. Sets of registers T (TA, TB, .. . TN) represent the lowest set of registers, each register in the Tregister set is associated with a bit of a corresponding register S.Illustrated registers TA, TD, and TN are associated with bits a, d and nof register SB. Associated with each of the bits in the T registers is acorresponding error recovery module as indicated; error recovery modulesTA 1, TA2, TA3 . . . TAN are associated with bits a, b, c . . . n ofregister TA, respectively, and similarly for module set TD and TNrelative to registers TD and TN. A shadow register is preferablyassociated with each of the registers; in the illustrative exampleshadow registers are shown for registers R and SB but are not shown forthe registers T. The shadow registers are capable of storing at leastthe same number of bits associated with each corresponding register andare used in conjunction with the redundancy and priority maskingdescribed below.

In registers T, each bit identifies whether or not an error indicatorcorresponding to an error recovery module is set; a logic "1" indicatesthe error indicator is set, a logic "0" indicates the error indicator isnot set. The bits in registers T are set, upon receipt of error signalswhich indicate faults as detected by corresponding devices or processes.The bits in the higher level registers S and R indicate whether or notany lower level error indicators in the associated hierarchy have beenset. In the illustrative example, bits a, d, and n in register SB areset since at least one bit in corresponding registers TA, TD, and TN areset. Similarly, bit b in register R is set since one or more bits inregister SB, are set. The hierarchy of registers operate such that a bitbecoming set in a lower level register propagates upwardly through theorganization. The highest level register is coupled to MPU 18 andcontrols the initiation of microprocessor interrupts to handle adetected fault condition.

FIG. 3 illustrates an exemplary data table associated with register SB.A similar data table exists for each of the registers of FIG. 2. Datatable rows SB(a)-SB(n) contain data corresponding to bits a-n,respectively, of register SB. A priority mask column and redundancy maskcolumn each contain supervisory data bits a-n. A next address vectorcolumn is associated with each row and contains a vector which points tothe next associated subordinate register routine or an associated errorrecovery module. The information contained in the table is predeterminedand entered in accordance with the error handling design for the system.The purpose of the redundancy mask is to provide the capability tocancel or mask a set error indicator in the same register as anotherconcurrently set error indicator. Such redundancy masking eliminatesunnecessary error recovery steps where it has been predetermined thatthe occurrence of a particular error indicator(s) concurrently withanother error indicator can best be handled by not executing one or moreerror recovery modules, such as where a common fault source gives riseto multiple fault detections. The purpose of the priority mask is todetermine the order in which multiple, nonredundant, error indicators inthe same register are to be handled. After a determination has been madeas to which set error indicator is to be handled first, its associatednext address vector is executed. After one lower level register T hasbeen cleared by execution of error recovery modules corresponding to setindicators, control returns to the register SB. The nonredundant errorindicator in register SB with the highest priority then causes recoveryexecution to continue as directed by the associated next address vector.This process continues until all error indicators in the register havebeen cleared upon which control passes to the next higher levelregister. Thus, all zero bits in the highest level register indicatesthat all lower level registers have been cleared and error recoverymodule processes executed.

FIG. 4 is a flow diagram of error recovery steps in accordance with anembodiment of the present invention. These steps are preferablyimplemented in software associated with MPU 18 and are executed for eachof the registers upon the MPU receiving an initial error recoveryinterrupt. The redundancy processing starts with BEGIN 40 where variableX is set to the state of bit a of the register being processed. Aspecific illustrative example will be described below with regard toregister SB. In step 44 a determination is made if X is set, i.e. ifX=1. Upon a YES determination, the redundancy mask stored in the tablefor the corresponding bit is read in step 46. The read redundancy maskis applied to the contents of the register being processed. Theredundancy mask associated with each set bit of a register is applied byapplying bits a-n of the redundancy mask to corresponding bits a-n inthe register. A "0" bit in the redundancy mask indicates no change ofstate is to be made to the corresponding bit in the register; a "1" bitin the redundancy mask causes the corresponding bit in the register tochange to a "0" state if it was previously in a "1" state. Thus, logic 1bits in the redundancy mask reset any corresponding bits in theregister. In step 50 a determination is made if variable X has been setto the state of bit n, i.e. has each bit in the subject register beentested by step 44? Upon a NO determination, variable X is incremented instep 52 by setting X equal to the state of the next bit, i.e. setX=state of b if it was previously set to the state of a. Control is thenreturned to step 44 which processes each set bit in accordance withsteps 46 and 48. A NO determination by step 44 which indicates thecorresponding bit is not set, results in control passing directly tostep 50 thereby bypassing steps 46 and 48. A YES determination by step50 indicates that each of the bits in the register has been processed;control then passes to another part of this method. Therefore, acorresponding redundancy mask for each of the set bits in a register isapplied to the contents of the register.

The priority processing starts with step 54 where the contents of theregister being processed by the method is copied to its correspondingshadow register. It should be noted that the register contents arecopied following the application of the redundancy mask so that thecontents of the shadow register may not be identical to the originalcontents of its associated register. In step 56 variable Y is set equalto the state of bit a of the corresponding shadow register. In step 58 adetermination is made as to whether Y is set. Upon a YES determination,the priority mask for bit Y is read in step 60. The priority mask isapplied to the data in the shadow register in step 62. The priority maskcorresponding to each set bit in the shadow register is applied to thecontents of the shadow register by comparing the priority mask bits a .. . n to the corresponding bits a . . . n of the shadow register. Alogic 1 in a bit in the priority mask causes the corresponding bit inthe shadow register to be reset, if set; a "0" in a bit in the prioritymask has no effect on the state of the corresponding bit in the shadowregister. In step 64 a determination is made if variable Y has beenloaded with the state of bit n, i.e. have all the bits in the shadowregister been tested by step 64? Upon a NO determination, Y isincremented to be set to the state of the next bit by step 66 andcontrol passed to step 58. Thus, for each set bit in the shadowregister, a corresponding priority mask is applied potentially causingcertain of the bits set in the shadow register to be reset therebygiving priority of execution order to the indicators (bits) remainingset in the shadow register.

A YES determination in step 64 indicates that each of the set bits inthe shadow register have had a corresponding priority mask applied tothe contents of the shadow register. Thus, error indicators nowcontained in the shadow register have been modified to implementredundancy and priority in accordance with predetermined correspondingcriteria determined by the table. In step 68 the next address vector isexecuted for each bit set in the shadow register. If the vector is to asubordinate register, a series of steps similar to that shown in FIG. 4will be applied to the subordinate register in accordance with datacontained in the subordinate register and its corresponding table.Should more than one bit be set in the shadow register following theredundancy and priority routines, the corresponding vectors will beexecuted in a predetermined manner such as a top-to-bottom order asviewed in the table. Step 68 acts as a "call" function to call eithersubordinate registers or error recovery modules, each of which willreturn control following execution to step 70. In step 70, bits in theregister are cleared which correspond to set bits in the shadowregister. That is, bits in the register which correspond to bits in theshadow register that have been executed by step 68 are cleared inpreparation for further prioritization. In step 72 a determination ismade if any bits remain set in the register. A YES determination causescontrol to return to step 54 wherein the modified contents of theregister are copied to the shadow register with prioritization againdetermined by steps 54-66. A NO determination by step 72 indicates thatall bits have been processed and hence, these steps end at RETURN 74.This permits control to be passed to a similar program associated with asuperior register, i.e. a register higher in the hierarchy.

The following example of processing in accordance with the presentinvention utilizes the states and data shown in FIGS. 2 and 3. In theillustrative example, three concurrent faults have been identified bycorresponding error signals as indicated by registers TA(b), TD(c) andTN(a). The logic 1s associated with each are rippled upward in thehierarchy to bits a, d, and n of register SB, which in turn causes alogic 1 in register R(b). The example illustrates processing of registerSB. It should be noted that the processing of register R in this examplerepresents a relatively trivial case in which no redundancy or conflictof priority exists since only one error indicator is set. The processingin accordance with the present invention begins with the highest levelregister and proceeds down in the hierarchy until each set errorindicator has been addressed.

Register SB is processed by applying the redundancy and priority masksshown in FIG. 3 in accordance with the steps shown in FIG. 4. Since onlybits a, d, and n are initially set in register SB, only thecorresponding redundancy masks SB(a), SB(d), and SB(n) can be consideredfor application in accordance with steps 42-52. For SB register bit a,the SB(a) redundancy mask indicates a redundancy only for bit n. Thusapplying the SB(a) redundancy mask causes bit n of register SB to bereset from 1 to 0. Application of the SB(d) redundancy mask results inno change since the redundancy mask indicates no correspondingredundancies, i.e. all bits have a 0 state in the mask. The SB(n)redundancy mask is not applied since the earlier SB(a) redundancy maskapplication caused bit n in the SB register to be reset to 0. Thus uponapplication of step 54 to register SB, the SB shadow register bits willhave the states as shown in FIG. 2 based upon application of theredundancy masks as shown in FIG. 3 to the original states of registerSB.

Steps 56-66 apply the priority masks of FIG. 3 to the contents of the SBshadow register as shown in FIG. 2 in which only bits a and d are set.Thus, only the priority masks for SB(a) and SB(d) can be considered forapplication to the contents of the SB shadow register. Only bits c and din the SB(a) priority mask are set thereby indicating that the errorrecovery module associated with bit a should be executed prior to theexecution of error recovery modules associated with either bits c or d.Since bit d is set in the illustrated example, application of the SB(a)priority mask to the SB shadow register will cause bit d to be reset to0. Thus, priority mask for SB(d) which would have been executed ifshadow register bit d had remained set, will not be executed in view ofbit d having been reset due to the application of priority mask SB(a).Therefore, upon entering step 68 only SB shadow register bit a will beset causing execution to transfer to the next address vector TAassociated with SB(a). Since only bit b in register TA is set, theredundancy and prioritization in accordance with the steps of FIG. 4associated with register TA are trivial resulting in the application oferror recovery module TA2. Following the execution of module TA2, returnpasses to step 70 in the program associated with register TA wherein bitb is reset. Since no bits remain set in register TA, control thenreturns to the program associated with register SB at step 70 as shownin FIG. 4. In accordance with step 70, only bit a of the SB shadowregister is set and hence, register SB bit a is reset, thereby leavingonly bit d being set in register SB (it will be remembered thatoriginally set bit n in register SB was reset in accordance with theredundancy mask earlier). In step 72 a YES determination is made in viewof bit d being set and return passes to step 54 which recopies thecurrent status of register SB to the SB shadow register wherein the SBshadow register has only bit d set. Since only one bit is set, theprioritization mask execution is trivial and results in step 68 passingcontrol to register TD pursuant to the next address vector of SB(d).Since only bit c of register TD is set, the redundancy andprioritization program associated with register TD will cause theexecution of error recovery module TD3, resetting of bit c in registerTD, and return of control to the program associated with register SB atstep 70. The remaining set bit d in register SB is reset according tostep 70 thereby causing a NO determination in step 72. Control passesback to the program associated with register R. Since all bits inregister SB will have been cleared to 0, the bit b in register R isreset to 0.

Because the redundancy and priority masks are independently controllablerelative to the error recovery actions, it is easy to make changes tothe redundancy between concurrently detected faults and establishdifferent priorities for the order of execution of error recoverymodules. This minimizes the need for additional testing of the faultrecovery system to verify that the system will operate properlyfollowing the change.

It will be apparent to those skilled in the art that the fault recoverysystem in accordance with the present invention could be implementedutilizing only the priority mask or the redundancy mask. However, it isbelieved that the combination of both the redundancy and priority masksprovides increased control while maintaining flexibility for laterchanges in controlling the fault recovery system. It will also beunderstood that the illustrative registers and shadow registers can beimplemented as defined memory locations in a RAM as well as inconventional registers.

Although an embodiment of the present invention has been describedherein and shown in the drawings, the scope of the invention is definedby the claims which follow.

We claim:
 1. A fault recovery arrangement for use in a computercontrolled system, said arrangement receiving error signalscorresponding to the detection of faults, the arrangementcomprising:register means for storing indicators corresponding to thereceipt of said error signals; table means for storing predeterminedsupervisory data corresponding to said indicators; means operating underthe control of a program for applying said supervisory data to saidindicators stored in the register means to select ones of saidindicators; and means operating under the control of a program forexecuting predetermined error recovery actions corresponding to saidselected indicators.
 2. The arrangement according to claim 1 whereinsaid table means stores one group of supervisory data that defineswhich, if any, set indicators are redundant to other concurrently setindicators, said set indicators corresponding to associated errorsignals being received.
 3. The arrangement according to claim 2 whereinsaid applying means applies said one group of supervisory data to saidindicators to mask redundant indicators, said masked indicators notbeing selected to initiate corresponding error recovery actions.
 4. Thearrangement according to claim 1 wherein said table means stores onegroup of supervisory data that defines the priority of said indicatorsand hence the order of execution of the corresponding error recoveryactions.
 5. The arrangement according to claim 1 wherein said tablemeans stores first and second groups of supervisory data that define theredundancy, if any, of said indicators and the relative priority of saidindicators, respectively.
 6. The arrangement according to claim 5wherein said applying means applies said first group of supervisory datato said indicators to eliminate any redundant indicators as determinedby said first group of data and then applies said second group ofsupervisory data to said indicators to determine the order of executionof associated error recovery actions.
 7. The arrangement according toclaim 1 wherein said register means comprises a lower group of memorystorage registers and a higher memory storage register coupled to saidlower group of registers, each memory storage register in said lower setbeing associated with an indicator in said higher memory storageregister.
 8. The arrangement according to claim 7 wherein each memorystorage register in said lower group includes a plurality of saidindicators, at least one of the indicators in a lower group registerrepresenting receipt of an error signal, said at least one indicatorbeing required for the corresponding indicator in said higher memorystorage register to indicate receipt of an error signal.
 9. Thearrangement according to claim 7 wherein each indicator in the memorystorage registers of said lower group is associated with one of saidpredetermined error recovery actions.
 10. The arrangement according toclaim 7 further comprising another lower group of memory storageregisters and another higher memory storage register associated withsaid lower group of memory storage registers and higher memory storageregister, respectively, said another set of memory storage registers andanother higher memory storage register being utilized by said applyingmeans to store modified indicators created by applying said supervisorydata to the indicators stored in corresponding registers.
 11. Thearrangement according to claim 1 wherein said table means stores a nextaddress vector for each of said indicators, said vector identifying anext register means associated with each indicator.
 12. A fault recoverymethod for use in a computer controlled system which receives errorsignals corresponding to the detection of faults, the method comprisingthe steps of:storing indicators corresponding to the receipt of saiderror signals; storing predetermined supervisory data corresponding tosaid indicators: applying said supervisory data to said storedindicators to create a shadow set of indicators; executing predeterminederror recovery actions corresponding to said shadow set of indicators.13. The method according to claim 12 wherein said storing step furthercomprises the step of storing one group of supervisory data that defineswhich, if any, of said indicators are redundant to other concurrentindicators, said applying step applying said one group of supervisorydata to said stored indicators to modify the state of redundantindicators so that error recovery actions corresponding to saidredundant indicators are not executed.
 14. The method according to claim12 wherein said storing step further comprises the step of storing onegroup of supervisory data defining the relative priority of saidindicators and hence defining the order of execution of saidcorresponding predetermined error recovery actions.
 15. The methodaccording to claim 12 wherein said storing step further comprises thestep of storing first and second groups of supervisory data that definethe redundancy, if any, of said indicators and the relative priority ofsaid indicators, respectively.
 16. The method according to claim 15wherein said step of applying applies said first group of supervisorydata to said indicators to eliminate redundant indicators as determinedby said first group of data and then applies said second group ofsupervisory data to said indicators to determine the order of executionof associated error recovery actions corresponding to said indicators.17. The method according to claim 12 further comprising the steps ofdetermining which, if any, of said indicators are redundant to otherindicators, said shadow set of indicators not including redundantindicators, determining the relative priority of said shadow indicators,and executing the predetermined error recovery actions corresponding tothe prioritized shadow indicators.
 18. A hierarchal fault recoveryarrangement for use in a computer controlled system which receives errorsignals corresponding the detection of faults, the arrangementcomprising:a plurality of memory storage elements arranged in ahierarchal organization, each element storing indicators which indicateif a corresponding error signal has been received; a data table storespredetermined supervisory data corresponding to said indicators; meansfor creating a modified set of indicators by mathematical application ofa group of supervisory data to a corresponding group of indicators;means for selecting an error recovery action based on said modifiedindicators to be executed from a group of error recovery actions. 19.The arrangement according to claim 18 wherein said creating meanscomprises means for eliminating certain concurrent indicators which areredundant to other concurrent indicators thereby inhibiting theexecution of error recovery actions associated with said certainindicators.
 20. The arrangement according to claim 18 wherein saidcreating means comprises means for establishing an order of execution ofa plurality of error recovery actions based on a prioritization ofindicators within one hierarchal level.
 21. The arrangement according toclaim 19 wherein said creating means further comprises means forprioritizing said other indicators so that an order of execution ofcorresponding error recovery actions is established.